We will continue to develop our understanding of ggplot. You should be familiar with the baseline property to proceed.
When you practice coding, you will encounter a lot of errors. The error message seems to be mysterious, but it is not random. We have already seen a few problems when an aesthetic is mistakenly set to a constant value instead of being mapped to a variable.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = "pink"))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
## `geom_smooth()` using formula 'y ~ x'
In this lecture, we will discuss some useful features of ggplot that also commonly cause trouble.
Some keywords you need to know:
The ggplot library is an implementation of the Grammar of graphics, an idea developed by Wilkinson (2005). It consists of several rules. If you break a rule, it will throw an error without any result. For example, you omitted + sign between ggplot object and geom_ functions. It is usually referred to as syntax error, which is easily captured. Other times, you made mistakes in your codes, but the codes did not break any rules. Or you might use wrong information, for example, different column inputs. How could we handle them?
Let’s see some common errors.
Go back to gapminder dataset. Suppose I want to each country’s GDP per capita by time. We have year, lifeExp, and country variables, so running
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line()
would provide the general trend line. Let’s see:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line()
You can guess what geom_line() does from:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_point()
gapminder %>% arrange(year) %>% head(10)
## # A tibble: 10 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Albania Europe 1952 55.2 1282697 1601.
## 3 Algeria Africa 1952 43.1 9279525 2449.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Argentina Americas 1952 62.5 17876956 5911.
## 6 Australia Oceania 1952 69.1 8691212 10040.
## 7 Austria Europe 1952 66.8 6927772 6137.
## 8 Bahrain Asia 1952 50.9 120447 9867.
## 9 Bangladesh Asia 1952 37.5 46886859 684.
## 10 Belgium Europe 1952 68 8730405 8343.
While ggplot will make a pretty good guess as to the structure
of the data, it does not know that the yearly observations in the data
are grouped by country. We have to tell it. In fact, geom_line() starts
with observation in 1952 in the first row of data and joins all 1952
data.
When you produce a plot but it looks weird, the problem is most likely in the mapping between the data and aesthetics for the geom_ being used.
In this case, we can use the group argument in aes() to tell ggplot explicitly about this structure.
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country))
It looks rough, but you will see that each line represents
country’s
Think about what’s happening here?:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = continent))
The last plot would be the one, but it is messy. Creating multiple plots in one panel would help. It would allow a lot of information to be compactly and comparably presented. This is called faceting data by some other variables. We have continent variable, which splits the data into 5.
The facet_wrap() function can take a series of arguments, but the most important is the first one.
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) + facet_wrap(~ continent)
~, tilde, is used for a formula in R syntax, and facets have only
one side. Most of the time, you will just want a single variable on the
right side of the formula.
Each facet is labeled at the top. The overall layout minimizes the duplication of axis labels and other scales. In fact, we can add the features we have learned in each facet. Let’s develop:
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country), color = "honeydew3") +
facet_wrap(~ continent, ncol = 5)
# see where color argument is located!
# what's the role of ncol argument?
Add smoother
p + geom_line(aes(group = country), color = "honeydew3") +
facet_wrap(~ continent, ncol = 5) +
geom_smooth(size = 1 , method = "loess", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# check out the arguments in geom_smooth
Scale
p + geom_line(aes(group = country), color = "honeydew3") +
facet_wrap(~ continent, ncol = 5) +
geom_smooth(size = 1 , method = "loess", se = FALSE) +
scale_y_log10(labels=scales::dollar)
## `geom_smooth()` using formula 'y ~ x'
Add labels
p + geom_line(aes(group = country), color = "honeydew3") +
facet_wrap(~ continent, ncol = 3) +
geom_smooth(size = 1 , method = "loess", se = FALSE) +
scale_y_log10(labels=scales::dollar) +
labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents")
## `geom_smooth()` using formula 'y ~ x'
The facet_wrap() function is best used when you want a series of small multiples based on a single categorical variable. Your panels will be laid out in order and then wrapped into a grid.
facet_grid() might be useful when you want to facet your data more than 2 categorical variables. See the Link. Let’s use another datafile. The following command would throw an error in your machine: why?
setwd("~/Documents/ibs_course/BUS240/data")
load('gss_sm.rda')
head(gss_sm, 10)
## # A tibble: 10 × 32
## year id ballot age childs sibs degree race sex region incom…¹ relig
## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None
## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None
## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath…
## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath…
## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None
## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None
## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None
## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath…
## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot…
## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None
## # … with 20 more variables: marital <fct>, padeg <fct>, madeg <fct>,
## # partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
## # zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
## # agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
## # bigregion <fct>, partners_rc <fct>, obama <dbl>, and abbreviated variable
## # name ¹income16
This is a sample of the General Social Survey in 2016. Compared to gapminder, this data set contains many categorical variables. Play it around what information is available.
See the following:
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point()
## Warning: Removed 18 rows containing missing values (geom_point).
It would indicate the relationship between the age of the respondent and the number of children they have. We will then facet this relationship by sex and race of the respondent.
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point()+facet_grid(sex ~ race)
## Warning: Removed 18 rows containing missing values (geom_point).
Add details on scatterplat
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = .3)+facet_grid(sex ~ race) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar()
There is only one mapping here. Note the y-axis. count is not in the data set. In fact, geom_bar() calls the stat_count() inside and calculate the number of count for the corresponding values. This function also calculates the proportion.
p + geom_bar(mapping = aes(y = ..prop..))
Ignore the figure now and just see how we access to the inside operations. We need to put the prop statistic. When ggplot calculates the count or the proportion, it returns temporary variables that we can use as mappings in our plots. To make sure these temporary variables won’t be confused with others we are working with, it should be mapping = ..statistic..
But still the figure looks not right. This is because of grouping.
p + geom_bar(mapping = aes(y = ..prop.., group = 2))
We need to force ggplot to use whole dataset instead of x-categories when calculating proportions. group = ‘pink’ is just a kind of “dummy group”. You can use anything, for example group = 2, creating a dummy.
Color?
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, color = bigregion))
p + geom_bar()
Note that fill is for painting the insides of shapes (remember ribbons?).
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = bigregion))
p + geom_bar()
Take a look
table(gss_sm$region)
##
## New England Middle Atlantic E. Nor. Central W. Nor. Central South Atlantic
## 175 313 502 193 550
## E. Sou. Central W. Sou. Central Mountain Pacific
## 205 297 235 397
Consider we want to look at religious preference by census region. You might recall the color argument in aes. Good, but we need to use fill.
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar()
Note that region of the country is on the x-axis, and counts of religious preference are stacked within the bars.
To see the relative share:
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = 'fill')
Note that the position argument in geom_bar() to “fill” which is not the same argument in aes().
What if we want to show the separate bars instead of showing the stacked?
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = 'dodge')
To convert it to see the proportions,
p + geom_bar(position = 'dodge',
mapping = aes(y = ..prop..))
p + geom_bar(position = 'dodge',
mapping = aes(y = ..prop.., group = religion))
When we just wanted the overall proportions for one variable, we mapped group = 1 to tell ggplot to calculate the proportions with respect to the overall N. In this case our grouping variable is religion, so we might try mapping that to the group aesthetic.
Or you can let facet function do the work
p <- ggplot(data = gss_sm, mapping = aes(x = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = bigregion)) +
facet_wrap(~ bigregion, ncol = 2)
What is a histogram?
head(midwest, 10)
## # A tibble: 10 × 28
## PID county state area poptotal popden…¹ popwh…² popbl…³ popam…⁴ popas…⁵
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98 249
## 2 562 ALEXANDER IL 0.014 10626 759 7054 3496 19 48
## 3 563 BOND IL 0.022 14991 681. 14477 429 35 16
## 4 564 BOONE IL 0.017 30806 1812. 29344 127 46 150
## 5 565 BROWN IL 0.018 5836 324. 5264 547 14 5
## 6 566 BUREAU IL 0.05 35688 714. 35157 50 65 195
## 7 567 CALHOUN IL 0.017 5322 313. 5298 1 8 15
## 8 568 CARROLL IL 0.027 16805 622. 16519 111 30 61
## 9 569 CASS IL 0.024 13437 560. 13384 16 8 23
## 10 570 CHAMPAIGN IL 0.058 173025 2983. 146506 16559 331 8033
## # … with 18 more variables: popother <int>, percwhite <dbl>, percblack <dbl>,
## # percamerindan <dbl>, percasian <dbl>, percother <dbl>, popadults <int>,
## # perchsd <dbl>, percollege <dbl>, percprof <dbl>, poppovertyknown <int>,
## # percpovertyknown <dbl>, percbelowpoverty <dbl>, percchildbelowpovert <dbl>,
## # percadultpoverty <dbl>, percelderlypoverty <dbl>, inmetro <int>,
## # category <chr>, and abbreviated variable names ¹popdensity, ²popwhite,
## # ³popblack, ⁴popamerindian, ⁵popasian
midwest is a pre-installed dataset in ggplot, including information on counties in the midwest.
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p + geom_histogram(bins = 10)
While histograms summarize single variables, it’s also possible to use several at once to compare distributions. We can facet histograms by some variable of interest, or as here we can compare them in the same plot using fill().
two_states <- c("IL", "MI")
p <- ggplot(data = subset(midwest, subset = state %in% two_states),
mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)
We subset the data here to pick out just two states. Here, illinois and Michigan. Then we use the subset() function to take our data and filter it so that we only select rows whose state name is in this vector. The %in% operator is a convenient way to filter on more than one term in a variable when using subset().
We can similarly build a density function.
p <- ggplot(data = midwest, mapping = aes(x = area))
p + geom_density()
p <- ggplot(data = midwest, mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3)
p <- ggplot(data = subset(midwest, subset = state %in% two_states),
mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))
setwd("~/Documents/ibs_course/BUS240/data")
load('titanic.rda')
head(titanic, 10)
## fate sex n percent
## 1 perished male 1364 62.0
## 2 perished female 126 5.7
## 3 survived male 367 16.7
## 4 survived female 344 15.6
p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex))
#p + geom_bar(position = "dodge")
p + geom_bar(position = "dodge", stat = "identity")